名词解释
antonymy
:反义关系synonymy
:同义关系DST
:dialogue state tracking task
aim
improve the vectors’ capability for judging semantic similarity
result
- lead to a new state of the art performance on the
SimLex-999
dataset - result in robust improvements across different dialogue domains
Introduction
这是一篇出自剑桥大学和苹果公司的论文。
传统的词向量,例如Glove
会有两个缺点,需要注入一些额外的知识特征来解决。
Drawbacks of learning word embeddings from co-occurrence information in corpora:
- coalesce the notions of
semantic similarity
andconceptual association
- similarity and antonymy can be application- or domain-specific
A method that addresse these two drawbacks by using synonymy and antonymy relations drawn from either a general lexical resource or an application-specific ontology to fine-tune distributional word vectors.
It’s a lightweight post-processing procedure in the spirit of retrofitting
.
Related Work
Most work on improving word vector representation using lexical resources has focused on bringing words which are known to be semantically related closer together in the vector space.
Some methods modify the prior or the regularization of the original training procedure.
Those word vectors which achieve the current state-of-the-art performance on the SimLex-999
are used as input for counter-fitting in the experiment of this paper.
The modelling work closest to this one is Qun Liu’sLearning semantic word embeddings based on ordinal knowledge constraints
。He use antonymy and WordNet hierarchy information to modify the heavyweight Word2Vec training objective
Counter-fitting Word Vectors to Linguistic Constraints
the original word vectors: $V=(\mathbf{v}_1,\mathbf{v}_2,…,\mathbf{v}_N)$
new word vectors: $V\prime=(\mathbf{v}^\prime_1,\mathbf{v}^\prime_2,…,\mathbf{v}^\prime_N)$
A
和S
是两个带有限制的集合,有多个对组成(i,j)
。然后可以有三个约束方程。前两个是让同义词更近,让反义词更远。
第三个则是让原有空间的分布式信息尽可能地保留下来。
Antonym Repel (AR)
$$AR(V^\prime) = \sum_{(u,w) \in A} \tau(\delta-d(\mathbf(v^{\prime}_{\mu},v^{\prime}_{w})))$$
$d(v_i,v_j)=1-cos(v_i,v_j)$ is a distance derived from cosine similarity
$\tau(x) \triangleq max(0,x)$
The $\delta$ is the “ideal” minimum distance between antonymous words. In this paper $\delta=1$. The $d(v_i,v_j) \in [0,2]$. So when $d(v_i,v_j) \in [0,1]$ there is a cost. When $d(v_i,v_j) \in [1,2]$ the cost is zero because they are far enough.
Synonym Attract (SA)
$$SA(V^\prime) = \sum_{(u,w) \in A} \tau(d(\mathbf(v^{\prime}_{\mu},v^{\prime}_{w}))-\gamma)$$
This is similar to the AR. Just set the ideal distance $\gamma=0$. But I think it’s too strange. Because $d(\mathbf{v}_u^\prime,\mathbf{v}_w^\prime)-\gamma)\geq0$. Maybe $\gamma=1$ is reasonable.
Vector Space Preservation (VSP)
$$VSP(V,V^\prime) = \sum_{i=1}^{N} \sum_{j in N(i)} \tau (d(\mathbf{v}_i^\prime, \mathbf{v}_j^\prime) - d(\mathbf{v}_i, \mathbf{v}_j))$$
This formula is also strange. This means if we pull the two word closer, the cost will not increase.
然后讲三个优化目标进行线性组合
$$C(V,V^\prime) = k_1 AR(V^\prime) + k_2 SA(V^\prime) + k_3 VSP(V,V^\prime)$$
Injecting Dialogue Domain Ontologies into Vector Space Representations
They use RNN framework directly on the n-gram features extracted from the automated speech recognition hypotheses. A
Experiments
Word Vectors and Semantic Lexicons
Glove and Paragram-SL999 are publicly availiable.
用了两个词库的约束
PPDB 2.0
只使用其中的Equivalence
关系和Exclusion
关系,并且只用了single-token
的项WordNet
没有使用其中的同义词
most frequent words: frequent word list
Improving Lexical Similarity Predictions
Use Spearman's rank correlation coefficient
with the SimLex-999
dataset, which contains word pairs ranked by a large number of annotators instructed to consider only semantic similarity.
Retrofitting
pre-trained word vectors improves GloVe vectors, but not the already semantically specialised Paragram-SL999
. Counter-fitting
substantially improves both sets of vectors, showing that injecting antonymy relations goes a long way towards improving word vectors for the purpose of making semantic similarity judegements.
Table 3 shows the effect of injecting different categories of linguistic constraints. There are three kinds of linguistics:
- PPDB- (PPDB antonyms)
- PPDB+ (PPDB synonyms)
- WordNet- (WordNet antonyms)
GloVe vectors benefit from all three sets of constraints. Paragram vectors which already exposed to PPDB, only improves with the injection of WordNet antonyms.
Table 4 shows counter-fitting correct 8 pairs in SimLex-999
. Five of the eight pairs do not appear in the sets of linguistics constraints. This shows that secondary (i.e. indirect) interactions through the three terms of the cost function do contribute to the semantic context of the transformed vector space.
Improving Dialogue State Tracking
In this section, starting from Paragram vectors did not lead to superior performance, which shows that injecting the application-specific ontology is at least as important as the quality of the initial word vectors.